-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add lazy copy for take and between_time #50476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/generic.py
Outdated
@@ -3779,6 +3779,8 @@ def _take( | |||
|
|||
See the docstring of `take` for full explanation of the parameters. | |||
""" | |||
if axis == 0 and np.array_equal(indices, np.arange(0, len(self))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do a quick check of how costly this is compared to the actual take
? (to ensure we don't introduce a performance regression for the other cases)
Maybe could also first check if the length of indices
is equal to len(self)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe could also first check if the length of
indices
is equal tolen(self)
np.array_equal
actually already does that, so doing that ourselves is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I profiled this, the cost is reduced if the DataFrame gets larger.
The actual problem is a bit different though: array_equal has no early exit, even if the first value is different, it checks the whole array (wanted to bring this up tomorrow as well). We should probably write something ourselves, because I guess we will need that a couple of times.
Switched to array_equal_fast here |
Co-authored-by: Joris Van den Bossche <[email protected]>
and array_equal_fast( | ||
indices, | ||
np.arange(0, len(self), dtype=np.intp), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further idea to potentially speed this up: right now we only use array_equal_fast
to check an array against a standard 0...n indexer, I think?
For that specific case, we don't actually need to create this second array, but could just use the iteration variable inside array_equal_fast
(it only needs the length).
(now, I don't know if we would want to start using array_equal_fast
for other cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(that also avoids creating this additional array even for the fast cases where the length doesn't even match)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll take a look at this in a follow up if ok to check how big the performance improvement would be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
@@ -481,6 +481,39 @@ def test_assign_drop_duplicates(using_copy_on_write, method): | |||
tm.assert_frame_equal(df, df_orig) | |||
|
|||
|
|||
@pytest.mark.parametrize("obj", [Series([1, 2]), DataFrame({"a": [1, 2]})]) | |||
def test_take(using_copy_on_write, obj): | |||
obj_orig = obj.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nitpick: can you add a comment here that is testing the corner case of taking all rows? (because in general take always by definition returns a copy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
Co-authored-by: Joris Van den Bossche <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.